Executive Summary:

For the Spring 2023 semester in STAT 370 Data Science Consulting, we worked closely with the Senior Data Scientist of US Foods to create a model that maximizes daily productivity in their warehouses. Given productivity metrics data, we had to transform, analyze, and select features that contributed to the daily productivity rate called “cases-per-man-hour”. We built multiple ensemble models and optimized them using grid searches. To better analyze and compare model performance and data trends we also created and implemented a Plotly visualization.

Introduction:

US Foods is a leading foodservice distributor partnered with around 300,000 restaurants and foodservice operators across the country, with the goal of becoming their primary food distributor. To achieve this goal, we must provide a wide variety of products to meet the demand. Along with warehouse physical capacity, we need to know how much labor we need. We create labor demand using productivity metrics. The goal of this project is to create a daily productivity model for warehouses.

Warehouses are the central hub of distributing goods to their partners, as they are in charge of packing, processing, and shipping the cases. These locations all measure different metrics such as the number of cases processed, the number of hours worked by specific work groups, different product zones and the number of items selected from each zone, and many others. To measure daily productivity we focus on the variable “cases-per-man-hour” (CPMH). This is a rate that is derived from the total number of cases processed by a warehouse per hour. The higher this rate, the more cases are processed per hour, which ultimately translates to higher productivity. We want to determine the factors that help predict and maximize this rate to accurately track and coordinate labor within warehouses.

Data Description:

As we are specifically creating a model to predict productivity in US Foods warehouses, the data we received were labor and production metrics from US Foods warehouses. The data was from over a four-year period, between 2019 and early 2023. There are 70 warehouses represented across three separate datasets. Before getting into the datasets specifically, it is important to clarify some common warehouse terms that are not generally known. Firstly, a SKU, or a stock-keeping unit, is a specific product identifier that is used by the manufacturer to track and manage their inventory. [1]. An example of a specific SKU could be 2% milk or fat-free cottage cheese. Eaches is related to SKU, as it refers to any individual item in a warehouse. So a gallon of milk is an each or a carton of eggs is an each. [2]. A case, meanwhile, is a container of eaches that is then either stored or sent to a consumer from the warehouse. [2]. Generally, a case would contain only eaches of the same SKU, but if a case was being sent to a small-scale consumer that does not need enough of a product to fill a case, the case that is sent to them may have eaches from multiple SKUs in it. The needs of a small local bakery and that of a 2000-student high school are operating at very different scales regarding their ordering needs and ordering quantities. With all that being said, hopefully, the features of the datasets will not be as confusing and ambiguous.

The first dataset, titled hours, contained information about the number of cases processed per day, per warehouse, and per cohort. Each observation in this dataset was of how many hours worked and cases processed by a specific cohort of a specific warehouse on a specific day. (See Figure 1 for an overview of the raw datasets). Our client informed us that each warehouse worker is sorted into a different cohort based on their skill level and experience. There are three different cohort levels: A, B, and C. Cohort A contains the most skilled and most experienced workers, Cohort B contains the next most skilled and experienced workers, and Cohort C contains the least skilled and experienced workers. Concerning our specific task, our client told us that we should expect to see a productivity increase from the more experienced cohorts versus the least experienced cohorts. (See Section Preprocessing). The next dataset, entitled cases, included information about the number of eaches processed per warehouse, per zone, and per day. The different zones included are dry, frozen, and cooled. These zones correspond with where each “eaches” came from (i.e. milk would be in the cooled zone while ice cream would be in the frozen zone). Finally, our last dataset, entitled sae, included data relating to the labeling system of each warehouse. This dataset has three variables, warehouse, a go-live date, and label type. Before each go-live date, the warehouse would use a system of calling out the SKU of each product to their coworkers so they would know what eaches to get. After this go-live date, a new system was implemented whereby each SKU was given a unique barcode that would tell the warehouse worker which item to get to pack a case. The label type variable indicates which specific type of barcode system was implemented in each warehouse. Our client gave us this dataset because, for about a four-week period after the go-live date, we should expect a significant drop in productivity in each warehouse. It is important to note that each warehouse has a different go-live date.

Methods:

Preprocessing:

We initially wanted to use all of the data available to us to predict the productivity rate, but after doing some exploratory data analysis and speaking with our client, we, unfortunately, had to cut most of the data that was given to us and all of the data that we had collected externally. Initially, we planned to use the cases and eaches information to predict a productivity metric that would represent how much “stuff” was getting during that particular day in that particular warehouse. However, the cases dataset (the one with the information about eaches), did not contain any information about the number of hours worked. There was also no cohort information present that could have allowed us to match the different cohorts from the hours dataset. And after speaking to our client, we concluded that with the data we have, there was no way for us to accurately determine how many hours were worked in the eaches dataset. (See Figure 1 for an overview of all the datasets).

We also wanted to include information about the labeling systems in our models as we were told it would significantly affect productivity. However, we just did not have enough time to implement a way to incorporate this data into our models. Although this information is not in our models, we will show later how productivity and our predictions of productivity are affected by these labeling changes. (See Figure 12 for an example of this decline).

Therefore, the data we were left with was entirely from the hours dataset and included date, cohort, and branch information, which will all be used as our primary predictors in our modeling. With our main goal of creating a model that best predicts the cases-per-man-hour rate of a warehouse, our data required some transformations to create our response variable, as well as other variables associated with measuring that response. To calculate cases-per-man-hour, we grouped all observations by date and warehouse, then we summed all of the hours and cases processed by each cohort together. We then created the response variable by dividing the total number of cases by hours. This new response variable, known as consump, represents the case productivity rate for all cohorts during a specific day in a specific warehouse. To prevent the loss of the cohort information, we created new variables that would measure the percentage of hours worked and cases processed by each cohort. This was done by taking the original hour and case numbers and dividing it by the new total. Finally, our last step was to convert the dates into more usable formats, which was done by extracting the specific day of the week and month from each date and adding them as separate variables. Although we initially did more “wrangling” to our dataset, this other wrangling is not relevant to our final dataset so it is not included in this report. (This is still included in our code, as seen in Appendix B). Our final dataset includes our response, cohort information (% of hours and cases worked), date information (both weekday and month), and branch information. (See Figure 2 for the final transformed dataset).

After we were finished “wrangling” our data, we did some analysis on it and found some alarming trends that needed to be addressed before we moved on to model building. The first of these problems was the extreme outliers we were seeing in our response variable. There were productivity rates from zero cases-per-man-hour, all the way to over 200,000 cases per man-hour. (See Figure 3). After conferring with our client, we learned that the max cases-per-man-hour should be around 300. We also saw that there were a few thousand observations of less than ten total hours worked. (See Figure 4). This is when we came to realize the inconsistency of the record-keeping that led to our data. We learned from our client that there was no standard method by which hours and cases were recorded in the warehouses, which means that there is no certainty that all of our data is accurate. This led us to the conclusion that we should drop all of the observations that were not feasible or realistic, which we decided to be all cases-per-man-hour rates above 300 and all total hours worked below 10. (These numbers were approved by our client as reasonable cutoffs for our data}. This led to dropping about 4000 observations, but also significantly reducing the outliers present in our data. (See Figure 5).

During our analysis that we described above, we also found some interesting information that helped inform us about how to proceed with our model building. The first of which was to confirm our client’s assertion that we should see a difference in the productivity rates between the different cohorts. This difference does appear in our data, with cohorts A and B being much more productive than C. (See Figure 6). Another interesting discovery that we made was the weekdays on which the warehouses were open. As we can see from Figure 7, both Saturday and Sunday have hardly any observations, with Fridays having significantly less than the other days as well. This would suggest that all of the warehouses are almost universally closed on Sundays, and almost all of them are closed on Saturdays as well. We used these findings to help us create the models that we discuss in the next section.

Models:

In total, we created 8 models based on our final dataset. We created one linear regression, four Gradient Boosted Regression models, one Extra Tree Model, one Random Forest, and one decision tree. We utilized pipelines to help us standardize the process of transforming and splitting the data. The first step we took in preparing the data was to set a standard random seed to ensure that the data would be split the same way each time and that any random aspects of any of our models would stay constant. We then created two pipelines, one to scale the numeric variables and one to transform the categorical variables into a form that could be used by the model functions. All of our pipelines, scalers, and models are from the Scikit-Learn Python package. After our data was in a suitable form, we then moved on to building our models.

Since our response variable was a rate, we decided to fit our data into a linear regression to serve as a baseline. The first model we focused on from here was a Gradient Boosted Regressor. GBR is an ensemble method that creates different weak models (trees) which are then combined as a whole for better performance [3]. We started outfitting our data into this model with its default parameters, which were a learning rate of 0.1, 100 estimators, and a max depth of 3. For clarity, the learning rate shrinks the contribution of each tree that was created, the number of estimators represents the number of trees created, and the depth determines how many splits the tree makes. [4]. The higher the number of estimators and depth is, the more the tree fits, but we had to keep in mind the more computationally expensive it was to fit as well. To tune this model, we performed a grid search that fit every possible combination of parameters on our data to find the best-performing set. We found that a learning rate of 0.3, 2000 estimators, and a max depth of 10 returned the best root mean squared error. However, when we analyzed the fit of our model against the actual data, we noticed that the model had very high variability as seen in Figure 13. After researching what might cause high variability in GBR models, we found that the ideal max depth for these models should be somewhere between 3 - 8 [5]. We kept the learning rate and number of estimators the same, but changed the depth to 6 and received the same RMSE and score, but also reduced the variability as well and fit much better with the actual data which can be seen in Figure 13.

GBR, our client also suggested building an Extra Trees Regressor. An Extra Tree Regressor has a similar architecture to a random forest but randomly chooses its splits instead of finding the optimal one making this more computationally efficient. [6]. We fit our data to this model with default parameters of max depth being none, meaning the nodes of a tree will keep expanding until all the leaves of the tree contain less than the minimum number of samples required to split a node [7], as well as the number of estimators equal to 100. We were surprised to see almost similar performance as our tuned GBR, making this model the second-best model and the only other model worth mentioning in terms of performance. Due to the amount of time we had spent tuning the GBR and fitting other models such as a random forest and decision tree, we did not perform a grid search and tune this model. However, with the performance it returned with the default parameters, we wonder if a tuned version of this model could have outperformed the GBR, which is one goal we hope to achieve in the future.

Results and Interpretation:

Our goal, as per our client, was to produce a model that could achieve an RMSE of less than 10, as that level of prediction accuracy would make it suitable for a production environment. Although we tried different models with various parameters, we were unable to create a model with the target RMSE. (See Figure 8 for the table of model metrics). By our determination, our best models were the tuned GBR and the Extra Trees Regressor, both with an RMSE of 15.95. Our other models such as the Random Forest and Decision tree returned RMSEs of 16.50 and 22.05 respectively, and with these models performing worse both score-wise and computationally, we prefer the GBR and Extra Trees. We can see through the comparison of the model prediction line vs the actual data in Figure 14 that these two models fit the best. We believe that with more specific data and accurate data, these models could have performed even better and brought us closer to the target RMSE.

In terms of our features, one interesting find with our models is the insignificance of the day of the week and month. In every model we had built, none had returned the day of the week or the month as an important feature, which contrasts our initial belief based on our analysis. When looking at what were the most important predictors for our models, we find that, among our best two models, the specific warehouse, the percentage of hours that cohort C worked, and the percentage of cases that cohort A processed are consistently the most important predictors for cases per man hour. We determined this by analyzing a permutation plot that randomly shuffles the values within a feature, breaking the relationship between that feature and the response variable, and indicating the importance of the feature to the model based on the drop in model score. The results of our permutation importance plot can be viewed in Figure 9. With this, we can infer that the specific branch of the warehouse is a big factor in predicting the cases-per man-hour rate, as some warehouses may be bigger or have more people, leading to the rate differing among these warehouses. It seems that cohort is also an important factor in warehouse productivity, with the best workers likely being from cohort A, basing this assumption on our previous data analysis and what our client has told us.

Conclusion and Future Work:

Reflecting on our work, an addition that would have benefitted us greatly would be more warehouse-specific data. Our client had dreamed of a model for daily productivity in a warehouse without being location/branch specific. With the data we were given and our models returning branch code as an extremely important factor in predicting cases per man hour, this was a goal that could not be accomplished without more warehouse data. This would entail factors such as the size of the warehouse, the number of people working per cohort, the average number of cases processed per day, and anything else that could help us to predict productivity. Having these additional pieces of information would have aided us in creating a model that was more catered to daily productivity in warehouses. We could have also benefited from better and more consistent record-keeping, specifically fewer outliers and null values, as this would have allowed us to keep more observations in our data.

One goal we wish to continue finishing in the future is incorporating the SAE data that we were given. Due to time constraints, we were not able to implement this data into our model. If we were to continue predicting past productivity, being able to implement this would be our next step in trying to better our predictions.

Another goal that we started and wish to accomplish in the future is to incorporate weather into our data to determine its effects on warehouse productivity and utilize it to predict warehouse closures. We had already integrated weather data through an API that added features such as the amount of rain or snow, sunrise and sunset, wind speeds, temperatures, visibility, and many others on a daily level based on the location of each warehouse. Warehouse closures have a dramatic effect on productivity, and being able to predict when that event may occur based on weather could greatly benefit in compensating and adjusting productivity metrics. The weather data could have also helped with productivity metrics of deliveries and shipping, as this area of the business is the one who have the most direct contact with outside environmental features. Overall, we are happy with the work we have done given the data we had at our disposal.

References:

[1] https://www.e-grocery.com/industry-terms/stock-keeping-unit-sku/

[2] https://www.e-grocery.com/industry-terms/eaches/

[3] https://towardsdatascience.com/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502

[4] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.htm

[5] https://inria.github.io/scikit-learn-mooc/python_scripts/ensemble_hyperparameters.html

[6] https://quantdare.com/what-is-the-difference-between-extra-trees-and-random-forest/#:~:text=In%20terms%20of%20computational%20cost,not%20calculate%20the%20optimal%20one.

[7] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

Appendix A:

Figure 1; Overview of Datasets

Overview of Cases Dataset
BRNCH_CD DIV_NM CLNDR_DT ZONE EACHES_SELECTED Date Total_Each_Day
242847 2N PITTSTON SYSTEMS 2017-01-30 DRY 7430 2017-01-29 18:00:00 18678
218065 6G ROANOKE 2017-01-30 DRY 18624 2017-01-29 18:00:00 51026
282919 9I OMAHA 2017-01-30 FRZ 5915 2017-01-29 18:00:00 18022
316762 3V INDIANAPOLIS 2017-01-30 FRZ 16004 2017-01-29 18:00:00 44477
36682 9D TAMPA 2017-01-30 CLR 19052 2017-01-29 18:00:00 64457
52242 4I PHOENIX 2017-01-30 FRZ 14410 2017-01-29 18:00:00 50567
Overview of Hours Dataset
DIV_NBR FULL_MARKET_NAME BRNCH_CD REPORT_DATE COHORT MEASURED_DIRECT_HRS CASES_SELECTED
3190 PHILADELPHIA - BRIDGEPORT (4V, 3190) 4V 2021-05-24 C 41.58778 3056
3190 PHILADELPHIA - BRIDGEPORT (4V, 3190) 4V 2019-10-07 A 151.69611 23783
3230 KANSAS CITY (6I, 3230) 6I 2021-10-27 A 276.62056 57192
2350 MINNESOTA (3F, 2350) 3F 2021-10-11 C 192.15584 19769
3310 BILLINGS (9J, 3310) 9J 2021-05-16 C 224.47278 23241
4146 PHOENIX (4I, 4146) 4I 2019-05-27 A 189.38194 29367
Overview of SAE Dataset
BRNCH_CD GO_LIVE_DATE LABEL_TYPE
2G 2021-10-31 DEMAND TANDEM WITH CARTS
9U 2020-02-16 MOBILE TEAR
9Q 2022-02-06 DEMAND TANDEM WITH CARTS
9P 2022-01-02 DEMAND TANDEM WITH CARTS
9O 2020-10-18 AUTO PEEL
9L 2020-03-22 AUTO PEEL

Figure 2: Final Dataset

brnch_cd a_hrspct c_hrspct a_cases c_cases cases_hrs weekday month
1 2Z 0.8633198 0.0652146 0.8574606 0.0674917 177.3933 Wednesday January
2 2Z 0.7855857 0.0608237 0.7897808 0.0661608 171.6771 Thursday January
4 2Z 0.8203791 0.0308109 0.8074854 0.0265296 169.0305 Monday January
5 2Z 0.8256355 0.0352773 0.8237783 0.0352853 166.4086 Tuesday January
6 2Z 0.8262748 0.0293957 0.8213931 0.0291916 160.4615 Wednesday January
7 2Z 0.8388019 0.0277680 0.8229432 0.0335182 169.3778 Thursday January

Figure 3: Histogram of Outliers in the Response Variable

Figure 4: Distribution of Hours Pre-Wrangling

Figure 5: Post-Wrangling Key Variable Distributions

Figure 6: Difference in Cases-per-Man-Hour by Cohort

Figure 7: Number of Observations by Weekday

Figure 8: Model Metrics

Model RMSE R2 Model Score
Base Linear Regression 20.88517 0.4617752 0.6495140
Base GBR 22.37675 -0.0550619 0.5976641
Improved GBR 18.29793 0.6909192 0.7309712
Improved GBR 2.0 16.35910 0.7404636 0.7849627
Improved GBR 3.0 15.94781 0.7431415 0.7956396
Xtra Tree 15.94812 0.7594577 0.7956315
RFR 16.50502 0.7248297 0.7811097
Decision Tree 22.04813 0.6117946 0.6093945

Figure 9: GBR Feature Importance

## <BarContainer object of 5 artists>
## ([<matplotlib.axis.YTick object at 0x7f2b9d962d70>, <matplotlib.axis.YTick object at 0x7f2b9d961c60>, <matplotlib.axis.YTick object at 0x7f2b9d960610>, <matplotlib.axis.YTick object at 0x7f2b9d990820>, <matplotlib.axis.YTick object at 0x7f2b9d991a50>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'a_hrspct'), Text(0, 3.5, 'c_hrspct'), Text(0, 4.5, 'a_cases')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f2b9d9ca290>, <matplotlib.lines.Line2D object at 0x7f2b9d9c99c0>, <matplotlib.lines.Line2D object at 0x7f2b9d9cb4f0>, <matplotlib.lines.Line2D object at 0x7f2b9d9cba60>, <matplotlib.lines.Line2D object at 0x7f2b9d92c7f0>, <matplotlib.lines.Line2D object at 0x7f2b9d92c400>, <matplotlib.lines.Line2D object at 0x7f2b9d92e200>, <matplotlib.lines.Line2D object at 0x7f2b9d92e650>, <matplotlib.lines.Line2D object at 0x7f2b9d92dc30>, <matplotlib.lines.Line2D object at 0x7f2b9d92f7c0>], 'caps': [<matplotlib.lines.Line2D object at 0x7f2b9d9c9e70>, <matplotlib.lines.Line2D object at 0x7f2b9d9c9390>, <matplotlib.lines.Line2D object at 0x7f2b9d9cbf70>, <matplotlib.lines.Line2D object at 0x7f2b9d9c9a80>, <matplotlib.lines.Line2D object at 0x7f2b9d92c880>, <matplotlib.lines.Line2D object at 0x7f2b9d92cd60>, <matplotlib.lines.Line2D object at 0x7f2b9d92eb60>, <matplotlib.lines.Line2D object at 0x7f2b9d92f0d0>, <matplotlib.lines.Line2D object at 0x7f2b9d97b040>, <matplotlib.lines.Line2D object at 0x7f2b9d979ae0>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f2b9d9c8250>, <matplotlib.lines.Line2D object at 0x7f2b9d9cb070>, <matplotlib.lines.Line2D object at 0x7f2b9d92dd50>, <matplotlib.lines.Line2D object at 0x7f2b9d92dcc0>, <matplotlib.lines.Line2D object at 0x7f2b9d92c3a0>], 'medians': [<matplotlib.lines.Line2D object at 0x7f2b9d9c9750>, <matplotlib.lines.Line2D object at 0x7f2b9d9cb430>, <matplotlib.lines.Line2D object at 0x7f2b9d92d2d0>, <matplotlib.lines.Line2D object at 0x7f2b9d92f5b0>, <matplotlib.lines.Line2D object at 0x7f2b9d9785b0>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f2b9d9caad0>, <matplotlib.lines.Line2D object at 0x7f2b9d92f310>, <matplotlib.lines.Line2D object at 0x7f2b9d92d750>, <matplotlib.lines.Line2D object at 0x7f2b9d92fa30>, <matplotlib.lines.Line2D object at 0x7f2b9d978460>], 'means': []}

Figure 10: Xtra Trees Feature Importance

## <BarContainer object of 5 artists>
## ([<matplotlib.axis.YTick object at 0x7f2b9d81c6d0>, <matplotlib.axis.YTick object at 0x7f2b9d807520>, <matplotlib.axis.YTick object at 0x7f2b9d804f40>, <matplotlib.axis.YTick object at 0x7f2b9d8367a0>, <matplotlib.axis.YTick object at 0x7f2b9d837490>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'c_hrspct'), Text(0, 3.5, 'a_cases'), Text(0, 4.5, 'a_hrspct')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f2b9d867e80>, <matplotlib.lines.Line2D object at 0x7f2b9d867880>, <matplotlib.lines.Line2D object at 0x7f2b9d87cb80>, <matplotlib.lines.Line2D object at 0x7f2b9d87d000>, <matplotlib.lines.Line2D object at 0x7f2b9d87e980>, <matplotlib.lines.Line2D object at 0x7f2b9d87ee00>, <matplotlib.lines.Line2D object at 0x7f2b9d87eef0>, <matplotlib.lines.Line2D object at 0x7f2b9d898160>, <matplotlib.lines.Line2D object at 0x7f2b9d899ae0>, <matplotlib.lines.Line2D object at 0x7f2b9d899f60>], 'caps': [<matplotlib.lines.Line2D object at 0x7f2b9d867370>, <matplotlib.lines.Line2D object at 0x7f2b9d8660b0>, <matplotlib.lines.Line2D object at 0x7f2b9d87d420>, <matplotlib.lines.Line2D object at 0x7f2b9d87d870>, <matplotlib.lines.Line2D object at 0x7f2b9d87f280>, <matplotlib.lines.Line2D object at 0x7f2b9d87f6d0>, <matplotlib.lines.Line2D object at 0x7f2b9d898580>, <matplotlib.lines.Line2D object at 0x7f2b9d8989d0>, <matplotlib.lines.Line2D object at 0x7f2b9d89a3b0>, <matplotlib.lines.Line2D object at 0x7f2b9d89a800>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f2b9d865150>, <matplotlib.lines.Line2D object at 0x7f2b9d87c760>, <matplotlib.lines.Line2D object at 0x7f2b9d87e530>, <matplotlib.lines.Line2D object at 0x7f2b9d87d540>, <matplotlib.lines.Line2D object at 0x7f2b9d899720>], 'medians': [<matplotlib.lines.Line2D object at 0x7f2b9d867790>, <matplotlib.lines.Line2D object at 0x7f2b9d87dd20>, <matplotlib.lines.Line2D object at 0x7f2b9d87fb50>, <matplotlib.lines.Line2D object at 0x7f2b9d898e50>, <matplotlib.lines.Line2D object at 0x7f2b9d89ac50>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f2b9d87c310>, <matplotlib.lines.Line2D object at 0x7f2b9d87e0e0>, <matplotlib.lines.Line2D object at 0x7f2b9d87ff10>, <matplotlib.lines.Line2D object at 0x7f2b9d8992a0>, <matplotlib.lines.Line2D object at 0x7f2b9d89b070>], 'means': []}

Figure 11: Random Forest Feature Importance

## <BarContainer object of 5 artists>
## ([<matplotlib.axis.YTick object at 0x7f2b9d70c970>, <matplotlib.axis.YTick object at 0x7f2b9d6fbc10>, <matplotlib.axis.YTick object at 0x7f2b9d6fa680>, <matplotlib.axis.YTick object at 0x7f2b9d70ffa0>, <matplotlib.axis.YTick object at 0x7f2b9d70fb80>], [Text(0, 0.5, 'c_cases'), Text(0, 1.5, 'brnch_cd'), Text(0, 2.5, 'c_hrspct'), Text(0, 3.5, 'a_hrspct'), Text(0, 4.5, 'a_cases')])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x7f2b9d740a30>, <matplotlib.lines.Line2D object at 0x7f2b9d7509a0>, <matplotlib.lines.Line2D object at 0x7f2b9d751c30>, <matplotlib.lines.Line2D object at 0x7f2b9d752050>, <matplotlib.lines.Line2D object at 0x7f2b9d753a30>, <matplotlib.lines.Line2D object at 0x7f2b9d753e50>, <matplotlib.lines.Line2D object at 0x7f2b9d76cb80>, <matplotlib.lines.Line2D object at 0x7f2b9d76cfd0>, <matplotlib.lines.Line2D object at 0x7f2b9d76e9b0>, <matplotlib.lines.Line2D object at 0x7f2b9d76edd0>], 'caps': [<matplotlib.lines.Line2D object at 0x7f2b9d7505b0>, <matplotlib.lines.Line2D object at 0x7f2b9d750af0>, <matplotlib.lines.Line2D object at 0x7f2b9d752470>, <matplotlib.lines.Line2D object at 0x7f2b9d752920>, <matplotlib.lines.Line2D object at 0x7f2b9d750d60>, <matplotlib.lines.Line2D object at 0x7f2b9d752260>, <matplotlib.lines.Line2D object at 0x7f2b9d76d3f0>, <matplotlib.lines.Line2D object at 0x7f2b9d76d840>, <matplotlib.lines.Line2D object at 0x7f2b9d76f1f0>, <matplotlib.lines.Line2D object at 0x7f2b9d76f6d0>], 'boxes': [<matplotlib.lines.Line2D object at 0x7f2b9d742710>, <matplotlib.lines.Line2D object at 0x7f2b9d7517b0>, <matplotlib.lines.Line2D object at 0x7f2b9d7535e0>, <matplotlib.lines.Line2D object at 0x7f2b9d76c730>, <matplotlib.lines.Line2D object at 0x7f2b9d76e500>], 'medians': [<matplotlib.lines.Line2D object at 0x7f2b9d750f40>, <matplotlib.lines.Line2D object at 0x7f2b9d752d40>, <matplotlib.lines.Line2D object at 0x7f2b9d753760>, <matplotlib.lines.Line2D object at 0x7f2b9d76dcc0>, <matplotlib.lines.Line2D object at 0x7f2b9d76fb50>], 'fliers': [<matplotlib.lines.Line2D object at 0x7f2b9d751360>, <matplotlib.lines.Line2D object at 0x7f2b9d753160>, <matplotlib.lines.Line2D object at 0x7f2b9d76c2e0>, <matplotlib.lines.Line2D object at 0x7f2b9d76e0e0>, <matplotlib.lines.Line2D object at 0x7f2b9d76ff40>], 'means': []}

Figure 12: Productivity drop after Label Change

Figure 13: Comparing the Old and New GBR Models after Depth Change

Figure 14: Comparing the Best GBR Model and Xtra Trees Model

Appendix B:

Code

knitr::opts_chunk$set(include = F)
setwd("/home/risto/Nextcloud/stat_370/final_report")

#TODO make sure to resolve all TODO's *******
library(reticulate)
library(kableExtra)
library(tidyverse)
library(plotly)

if(!virtualenv_exists(envname = "./usfoods_env")){
  install_python(version="3.10.11")
  virtualenv_create(envname = "./usfoods_env", version = "3.10.11", packages = c("pandas==1.5", "numpy", "matplotlib", "scikit-learn==1.1.3", "joblib"))
  
}
virtualenv_python(envname="./usfoods_env")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
"""# Cases
* Change date to date_time object
    * Create a new Date column for both datasets
    * Allows merging on shared column
*  Sort dates in ascending order
"""
sae = pd.read_csv('data/sae.csv')
cases = pd.read_csv("data/Cases.csv")

#change CLNDR_DT from object to date_time
  # make sure to create new column 'Date' for both datasets so they share common 
  # for merging

cases['Date'] = pd.to_datetime(cases['CLNDR_DT'])

#sort dates from past to current
cases = cases.sort_values(by='CLNDR_DT')

## needed to drop NaN in order for 'tot_each_day' to be correct number
    ## ask why including NaN would yield wrong sum
        ## what values does NaN have?

cases = cases.dropna()

## create new dataframe from cases
    ## focusing on getting total of eaches selected per day
cases_tot_day = pd.DataFrame(cases)

## group by Date and BRNCH_CD, sum eaches_selected by each zone into one total each column
cases_tot_day['Total_Each_Day'] = pd.DataFrame(cases.groupby(['Date', 'BRNCH_CD'])['EACHES_SELECTED'].transform('sum'))

## New Dataframe with order = Date, Branch, zone, eaches selected, total eaches
cases4 = pd.DataFrame(cases, columns = (['Date', 'BRNCH_CD', 'ZONE', 'EACHES_SELECTED', 'Total_Each_Day']))

## Change index starting at 1, drop existing index to avoid adding as column
cases4 = cases4.reset_index(drop=True)

## get ratio of total eaches per zone
    ## eaches selected of zone / total eaches
    ## add ratio to a list and append to dataset after

ratio = []
for i in range(0, len(cases4)):
  #print(i)
  r = cases4.EACHES_SELECTED[i] / cases4.Total_Each_Day[i]
  ratio.append(r)

## add list of ratios to dataframe, should be in order
cases4['ratio'] = ratio

#type(cases4.ZONE[1])

## create a list for every zone (dry, clr, frz)
## for every row in dataframe:
    ## if the zone of the row == "DRY":
        ## add the ratio at index i to the "DRY" list
        ## add zeros to the other lists (ClR, FRZ) (place holder)
    ## if zone of row == "CLR":
        ## add ratio to CLR list
        ## add zeros to other lists
    ## repeat for FRZ

## add lists as columns to dataset

dry_ratio = []
clr_ratio = []
frz_ratio = []

for i in range(0, len(cases4)): 
  if cases4.ZONE[i] == 'DRY':
    dry_ratio.append(cases4.ratio[i])
    clr_ratio.append(0)
    frz_ratio.append(0)
  elif cases4.ZONE[i] == 'FRZ':
    dry_ratio.append(0)
    clr_ratio.append(0)
    frz_ratio.append(cases4.ratio[i])
  elif cases4.ZONE[i] == 'CLR':
    dry_ratio.append(0)
    clr_ratio.append(cases4.ratio[i])
    frz_ratio.append(0)


cases4['dry_ratio'] = dry_ratio
cases4['clr_ratio'] = clr_ratio
cases4['frz_ratio'] = frz_ratio

## sort dataframe by date and branch to see everything together
    ## check if math is correct
cases4 = cases4.sort_values(by=['Date', 'BRNCH_CD'], ascending = [True, True])

## smoosh multiple rows of same branch and date into 1 row
    ## aggregate sums, zeros as placeholders will have no effect on ratio
cases6 = cases4.groupby(['Date', 'BRNCH_CD', 'Total_Each_Day'], as_index = False).agg('sum')


## drop eaches since same as total
## drop ratio since all ratios add up to 1 anyway
cases_EachPerDay = cases6.drop(columns=['EACHES_SELECTED', 'ratio'])

"""# Hours


*   Repeat steps from cases


"""

hours = pd.read_csv("data/hours.csv")

hours['Date'] = pd.to_datetime(hours['REPORT_DATE'])

hours = hours.sort_values(by=['REPORT_DATE', 'BRNCH_CD'], ascending = [True, True])

hours = hours.drop(columns=['DIV_NBR', 'REPORT_DATE', 'FULL_MARKET_NAME'])

## drop "NaN" (UNKOWN)
hours = hours[hours.COHORT != 'UNKNOWN']

## create datarame in order desired
hours_co = pd.DataFrame(hours, columns = ['Date', 'BRNCH_CD', 'COHORT', 'CASES_SELECTED', 'MEASURED_DIRECT_HRS'])

## group by Date and branch to get sum of HOURS MEASURED of each cohort as total hours worked at branch on specific day
hours_co['Total_Hours'] = pd.DataFrame(hours.groupby(['Date', 'BRNCH_CD'])['MEASURED_DIRECT_HRS'].transform('sum'))


## group by Date and branch to get sum of CASES SELECTED of each cohort as total cases selected at branch on specific day
hours_co['Total_Cases'] = pd.DataFrame(hours.groupby(['Date', 'BRNCH_CD'])['CASES_SELECTED'].transform('sum'))


hours_co = hours_co.reset_index(drop = True)

## like hours, get ratio of hours worked per cohort relative to total hours measured at branch on specific day
Hrs_PerCo = []
for i in range (0, len(hours_co)):
  r = hours_co.MEASURED_DIRECT_HRS[i] / hours_co.Total_Hours[i]
  Hrs_PerCo.append(r)

hours_co['Hrs_Pct'] = Hrs_PerCo

## same process of placing respective hours ratios per cohort as eaches per zone in cases
    ## 0.000 = placeholder for aggregation
A_Hours = []
B_Hours = []
C_Hours = []

for i in range(0, len(hours_co)):
  if hours_co.COHORT[i] == 'A':
    A_Hours.append(hours_co.Hrs_Pct[i])
    B_Hours.append(0)
    C_Hours.append(0)
  elif hours_co.COHORT[i] == 'C':
    A_Hours.append(0)
    B_Hours.append(0)
    C_Hours.append(hours_co.Hrs_Pct[i])
  elif hours_co.COHORT[i] == 'B':
    A_Hours.append(0)
    B_Hours.append(hours_co.Hrs_Pct[i])
    C_Hours.append(0)

hours_co['A_HrsPct'] = A_Hours
hours_co['B_HrsPct'] = B_Hours
hours_co['C_HrsPct'] = C_Hours

## ratio of cases per cohort
Cases_PerCo = []
for i in range (0, len(hours_co)):
  r = hours_co.CASES_SELECTED[i] / hours_co.Total_Cases[i]
  Cases_PerCo.append(r)

hours_co['Cases_Pct'] = Cases_PerCo

## cases ratio per cohort w/ placeholders
A_Cases = []
B_Cases = []
C_Cases = []

for i in range(0, len(hours_co)):
  if hours_co.COHORT[i] == 'A':
    A_Cases.append(hours_co.Cases_Pct[i])
    B_Cases.append(0)
    C_Cases.append(0)
  elif hours_co.COHORT[i] == 'C':
    A_Cases.append(0)
    B_Cases.append(0)
    C_Cases.append(hours_co.Cases_Pct[i])
  elif hours_co.COHORT[i] == 'B':
    A_Cases.append(0)
    B_Cases.append(hours_co.Cases_Pct[i])
    C_Cases.append(0)


hours_co['A_Cases'] = A_Cases
hours_co['B_Cases'] = B_Cases
hours_co['C_Cases'] = C_Cases

hours_CoHrs = hours_co.groupby(['Date', 'BRNCH_CD', 'Total_Hours', 'Total_Cases'], as_index = False).agg('sum')
hours_CoHrs

"""# New Datasets"""

## Date = CLNDR_DATE
## BRNCH_CD = code referring to sepcific Warehouse

## Total_Each_Day = sum of eaches selected by all zones on specific date
    ## dry_ratio = proportion of dry products relative to total eaches selected on specific date 
    ## clr_ratio = proportion of cooler products relative to total eaches selected on specific date
    ## frz_ratio = proportion of freezer products relative to total eaches selected on specific date

hours_CohortHrsCases = pd.DataFrame(hours_CoHrs, columns = ['Date', 'BRNCH_CD', 
                                                            'Total_Hours', 'Total_Cases',
                                                            'A_HrsPct',
                                                            'B_HrsPct', 'C_HrsPct', 
                                                            'A_Cases', 'B_Cases','C_Cases' 
                                                            ])
## Date = REPORT_DATE
## BRNCH_CD = code referring to sepcific Warehouse

## Total_Hours = sum of hours worked by all cohorts on specific date
    ## A_HrsPct = proportion of hours worked by Cohort A relative to total hours on specific date
    ## B_HrsPct = proportion of hours worked by Cohort B relative to total hours on specific date
    ## C_HrsPct = proportion of hours worked by Cohort C relative to total hours on specific date

## Total_Cases = Sum of all cases selected by all cohorts on specific date
    ## A_Cases = proportion of cases selected by Cohort A relative to total cases selected on specific date
    ## B_Cases = proportion of cases selected by Cohort B relative to total cases selected on specific date
    ## C_Cases = proportion of cases selected by Cohort C relative to total cases selected on specific date

"""# Merge
* lost a lot of data :/
* dont merge yet?
"""

Foods = hours_CohortHrsCases.merge(cases_EachPerDay, how = 'inner', on = ['Date', 'BRNCH_CD'])

#adding dates and merging with the sae dataset
Foods2 = Foods

#adding the response variable(s) to the dataset
Foods2["cases_hrs"] = Foods2['Total_Cases']/Foods2['Total_Hours'] #creates the dependent variable we are trying to measure
Foods2.replace([np.inf, -np.inf], np.nan, inplace=True) #converting all infinites to na's. will want to come up with a better solution later 
Foods2 = Foods2.dropna() #drops all na's for now

weekdays = {0: "Monday",
            1: "Tuesday",
            2: "Wednesday",
            3: "Thursday",
            4: "Friday",
            5: "Saturday",
            6: "Sunday"}

months =   {1: "January",
            2: "February",
            3: "March",
            4: "April",
            5: "May",
            6: "June",
            7: "July",
            8: "August",
            9: "September",
            10: "October",
            11: "November",
            12: "December"}
            
Foods2['weekday'] = Foods2['Date'].apply(lambda x: weekdays[x.weekday()]) #adds the weekday of the particular entry 
Foods2['month'] = Foods2['Date'].apply(lambda x: months[x.month]) #add the reporting_date month as a separate variable


#adds the sae data to the main dataset
sae = pd.read_csv('data/sae.csv')

Foods3 = Foods2.merge(sae, how='outer', on='BRNCH_CD')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
import random
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor


random_seed = 46
np.random.seed(random_seed)
random.seed(random_seed)

dataset = Foods3

dataset = dataset[dataset["cases_hrs"] <= 300]
dataset = dataset[dataset["Total_Hours"] >= 10]

#dropping all of the uneeded columns
to_drop = ["Date", "Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "GO_LIVE_DATE", "LABEL_TYPE"]
dataset_build = dataset.drop(labels=to_drop, axis=1)
dataset_build.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)


#pipelines
numeric_features = ["a_hrspct",  'c_hrspct', 'a_cases', 'c_cases']
numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)
categorical_features=['brnch_cd', 'weekday', 'month']
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder())
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

X_train, X_test, y_train, y_test = train_test_split(dataset_build.drop(labels="cases_hrs", axis=1),dataset_build['cases_hrs'], random_state=random_seed, train_size = .70)

scores = []
#linear regression
reg_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", LinearRegression())]
)
reg_pipe.fit(X_train, y_train)
scores.append(["Reg", reg_pipe.score(X_test, y_test)])

#gradient boosting regressor model
gbr_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(random_state=random_seed))]
)
gbr_pipe.fit(X_train, y_train)
scores.append(["GBR Init", gbr_pipe.score(X_test, y_test)])

#improved gbr
gbr_improved_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1, n_estimators=1000, learning_rate=0.3, max_depth=10, random_state=random_seed, loss='squared_error'))]
)
gbr_improved_pipe.fit(X_train, y_train)
scores.append(["GBR Improved", gbr_improved_pipe.score(X_test, y_test)])

#improved gbr 2.0 - lower depth
gbr_improved_pipe_2 = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1, n_estimators=1000, learning_rate=0.3, max_depth=5, random_state=random_seed, loss='squared_error'))]
)
gbr_improved_pipe_2.fit(X_train, y_train)
scores.append(["GBR Improved", gbr_improved_pipe_2.score(X_test, y_test)])

#improved gbr
    ## to reduce variablity, depth for gbr should be between 4 - 8
    ## n_estimators = 2000
    ## smaller learning rate = 0.05 - 0.15 is good
    ## 4 minutes
gbr_improved_pipe_3 = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", GradientBoostingRegressor(verbose=1,
                                                                               n_estimators=2000,
                                                                               learning_rate=0.05,
                                                                               max_depth=6,
                                                                               loss='squared_error'))]
)
gbr_improved_pipe_3.fit(X_train, y_train)
scores.append(["GBR Improved 3",gbr_improved_pipe_3.score(X_test, y_test)])

#Extra trees base model, 5 minutes
xtratree_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", ExtraTreesRegressor(verbose=1))]
)
xtratree_pipe.fit(X_train, y_train)
scores.append(["xtratree",xtratree_pipe.score(X_test, y_test)])

#baseline RFR, outperformed by extratrees regressor
randfor_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", RandomForestRegressor(verbose=1))]
)
randfor_pipe.fit(X_train, y_train)
scores.append(["RFR", randfor_pipe.score(X_test, y_test)])

#baseline Decision tree regressor
dectree_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", DecisionTreeRegressor())]
)
dectree_pipe.fit(X_train, y_train)
scores.append(["Decision Tree",dectree_pipe.score(X_test, y_test)])

#random forest grid search

crossvalidation=KFold(n_splits=3,shuffle=True,random_state=1)
randfor_param = {
             'max_depth': [5, 10, 15],
             'n_estimators': [500, 1000, 1500]}
randfor_search = GridSearchCV(RandomForestRegressor(), randfor_param, refit = True, verbose = 3, cv = crossvalidation, scoring = 'neg_mean_squared_error')
randfor_search.fit(X_train, y_train)
from joblib import load
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import set_config
import matplotlib.pyplot as plt
from sklearn.inspection import permutation_importance

#setting the random seed
random_seed = 46
np.random.seed(random_seed)
random.seed(random_seed)

#loading in the models and datasets
reg = load("models/lin_reg.joblib")
gbr_init = load("models/gbr_init.joblib")
gbr_imp = load("models/gbr_improved.joblib")
reg_golive = load("models/lin_reg_golive.joblib") 
gbr_init_golive =  load("models/gbr_init_golive.joblib")
gbr_imp_golive = load("models/gbr_improved_golive.joblib")
gbr_imp_onewarehouse = load("models/gbr_improved_onewarehouse.joblib")
gbr_imp_2 = load("models/gbr_improved_2.joblib")
gbr_imp_3 = load("models/gbr_improved_3.joblib")
xtratree = load("models/xtra_tree.joblib")
randfor = load("models/rfr.joblib")
dectree = load("models/dec_tree.joblib")

dataset=dataset_build

models = []

#testing the models 

X_train, X_test, y_train, y_test = train_test_split(dataset.drop(labels="cases_hrs", axis=1),dataset['cases_hrs'], random_state=random_seed, train_size = .70)

X = X_test
y = y_test

#reg model
reg_pred = reg.predict(X)
models.append(["Base Linear Regression", mean_squared_error(reg_pred, y, squared=False), r2_score(reg_pred,y), reg.score(X, y)])

#gbr1 model
gbr_init_pred = gbr_init.predict(X)
models.append(["Base GBR", mean_squared_error(gbr_init_pred, y, squared=False), r2_score(gbr_init_pred,y), gbr_init.score(X, y)])

#improved gbr model
gbr_imp_pred = gbr_imp.predict(X)
models.append(["Improved GBR", mean_squared_error(gbr_imp_pred, y, squared=False), r2_score(gbr_imp_pred,y), gbr_imp.score(X, y)])

#improved gbr model 2.0
gbr_imp_2_pred = gbr_imp_2.predict(X)
models.append(["Improved GBR 2.0", mean_squared_error(gbr_imp_2_pred, y, squared=False), r2_score(gbr_imp_2_pred,y), gbr_imp_2.score(X, y)])

#improved gbr model 3.0
gbr_imp_3_pred = gbr_imp_3.predict(X)
models.append(["Improved GBR 3.0", mean_squared_error(gbr_imp_3_pred, y, squared=False), r2_score(gbr_imp_3_pred,y), gbr_imp_3.score(X, y)])

#xtra tree model
xtratree_pred = xtratree.predict(X)
models.append(["Xtra Tree", mean_squared_error(xtratree_pred, y, squared=False), r2_score(xtratree_pred,y), xtratree.score(X, y)])

#RFR model
randfor_pred = randfor.predict(X)
models.append(["RFR", mean_squared_error(randfor_pred, y, squared=False), r2_score(randfor_pred,y), randfor.score(X, y)])

#RFR model
dectree_pred = dectree.predict(X)
models.append(["Decision Tree", mean_squared_error(dectree_pred, y, squared=False), r2_score(dectree_pred,y), dectree.score(X, y)])
#python model prez
models_pres = pd.DataFrame(models, columns=["Model", "RMSE", "R2", "Model Score"])

gbr_result = permutation_importance(
    gbr_imp_3, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)


xtratree_result = permutation_importance(
    xtratree, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)


randfor_result = permutation_importance(
    randfor, X_test, y_test, n_repeats=5, random_state=42, n_jobs=1
)

#!! pandas import/data wrangling for presentation stuff
master_data = pd.read_csv("data/master_dataset.csv")
gbr_model = load("models/gbr_improved.joblib")
hours = pd.read_csv("data/hours.csv")


gbr_model_old = load("models/gbr_improved.joblib")
gbr_model = load("models/gbr_improved_3.joblib")
xtratree_model = load("models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")

#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]

#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")

#removing the real values
y = model_data.pop('cases_hrs')
X = model_data


# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
    #replacing the actual cpmh values with predicted ones from the model
    df_pred = X
    if(model == 1):
        df_pred['cases_hrs_pred'] = gbr_model.predict(X)
    if(model == 2):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 3):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 4):
        df_pred['cases_hrs_pred'] = gbr_model_old.predict(X)
    df_pred['cases_hrs_real'] = y
    df_pred['Date'] = dates
    df_pred["GO_LIVE_DATE"] = go_live
    df_pred['Date'] = pd.to_datetime(df_pred['Date'])
    return df_pred

#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
    for d in div_nm:
        if b in d:
            branches[b] = d

knitr::opts_chunk$set(include = T)
knitr::opts_chunk$set(echo = F)
knitr::opts_chunk$set(message =  F)
knitr::opts_chunk$set(warning = F)


foods = py$Foods3
model_data = py$dataset
hours = py$hours

kable(head(py$cases), caption="Overview of Cases Dataset")
kable(head(py$hours), caption="Overview of Hours Dataset")
kable(head(py$sae), caption="Overview of SAE Dataset")
kable(head(py$dataset_build))
#code about the outliers in the response variable
foods$consump = foods$Total_Cases/foods$Total_Hours
hist(foods$consump, main="Distribution of Cases-per-Man-Hour Pre-Wrangling")
#code about the total hours being extremely low
ggplot(data=foods, aes(x=Total_Hours)) + geom_histogram() + ggtitle("Distribution of Total Hours Pre-Wrangling")

#code about the differences in the datasets after dropping the outliers
food.less = foods[foods$Total_Hours > 10,]
food.less = food.less[food.less$consump < 300, ]

ggplot(data=food.less, aes(x=Total_Hours)) + geom_histogram() + ggtitle("Distribution of Total Hours Post-Wrangling")

ggplot(data=food.less, aes(x=consump)) + geom_histogram() + ggtitle("Distribution of Cases-per-Man-Hours Post-Wrangling")

#code about the difference in productivity rates between the cohorts 
hours.consump = hours %>% dplyr::mutate(consump = CASES_SELECTED / MEASURED_DIRECT_HRS) %>% filter(MEASURED_DIRECT_HRS > 10) %>% filter(consump < 300)

ggplot(data=hours.consump, aes(y=consump, x=COHORT)) + geom_boxplot() + ggtitle("Distribution of Cases-per-Man-Hours Post-Wrangling")
barplot(table(foods$weekday), main="Number of Observations by Weekday")
model.pres = py$models_pres 
kable(model.pres)
#gbr 3 Feature importance / Permutation importance plot 
feature_importance = gbr_imp_3.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")

sorted_idx = gbr_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
    gbr_result.importances[sorted_idx].T,
    vert=False,
    labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#Xtratree Feature importance / Permutation importance plot
feature_importance = xtratree.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")


sorted_idx = xtratree_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
    xtratree_result.importances[sorted_idx].T,
    vert=False,
    labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#Random Forest Feature importance plot 
feature_importance = randfor.steps[1][1].feature_importances_
sorted_idx = np.argsort(feature_importance[:5])
pos = np.arange(sorted_idx.shape[0]) + 0.5
fig = plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, np.array(X_train.columns)[sorted_idx])
plt.title("Feature Importance (MDI)")

sorted_idx = randfor_result.importances_mean[:5].argsort()
plt.subplot(1, 2, 2)
plt.boxplot(
    randfor_result.importances[sorted_idx].T,
    vert=False,
    labels=np.array(X_train.columns)[sorted_idx],
)
plt.title("Permutation Importance (test set)")
fig.tight_layout()
plt.show()
#productivity drop after label change
vline <- function(x = 0, color = "red") {
  list(
    type = "line",
    y0 = 0,
    y1 = 1,
    yref = "paper",
    x0 = x,
    x1 = x,
    line = list(color = color)
  )
}

data = py$model_pred(0)
data = data[data$brnch_cd == "2Z",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date =  as.Date(data$Date)

branches = py$branches

p = plot_ly(x=data$Date, 
            y=data$cases_hrs_real, type="scatter", mode="lines", name="real") %>% 
  layout(
    title=paste("Productivity for Branch", branches["2Z"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>%    add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250),

           text = c("Go-Live Date"))
  
p 

#code for comparing the new and old gbr models
data = py$model_pred(4)
newn = py$model_pred(1)

data$new_predict = newn$cases_hrs_pred

data = data[data$brnch_cd == "4C",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date =  as.Date(data$Date)

  
p = plot_ly(x=data$Date, 
            y=data$cases_hrs_real, type="scatter", mode="lines", name="real") %>% 
  add_trace(y = ~data$new_predict, name = 'new model',mode = 'lines') %>% 
  add_trace(y = ~data$cases_hrs_pred, name = 'old model',mode = 'lines') %>%
  layout(
    title=paste("Productivity for Branch", branches["4C"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>%    add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250), text = c("Go-Live Date"))
  
p 
#code for comparing best gbr model and actual values
xtra = py$model_pred(2)
data = py$model_pred(1)
data$xtra = xtra$cases_hrs_pred
data = data[data$brnch_cd == "3Y",]
data$GO_LIVE_DATE = as.Date(data$GO_LIVE_DATE)
data$Date =  as.Date(data$Date)

  
p = plot_ly(x=data$Date, 
            y=data$cases_hrs_real, type="scatter", mode="lines", name="Real") %>% 
  add_trace(y = ~data$cases_hrs_pred, name = 'GBR',mode = 'lines') %>%  
  add_trace(y = ~data$xtra, name = 'Xtra Trees',mode = 'lines') %>% 
  layout(
    title=paste("Productivity for Branch", branches["3Y"]), shapes=list(vline(data$GO_LIVE_DATE[1]))) %>%    add_text(showlegend = F, x = c(data$GO_LIVE_DATE[1] +100), y = c(250), text = c("Go-Live Date"))
  
p 
from dash import Dash, dcc, html, Input, Output
import pandas as pd
import plotly.graph_objects as go
from joblib import load
from datetime import datetime

#create a dashboard that shows these graphs by each warehouse

#!! pandas import/data wrangling for presentation stuff
if TESTING:
    gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
    master_data = pd.read_csv("../../notebooks/data/master_dataset.csv")
    hours = pd.read_csv("../../notebooks/data/hours.csv")
else:
    master_data = pd.read_csv("/app/notebooks/data/master_dataset.csv")
    gbr_model = load("/app/notebooks/models/gbr_improved_dock.joblib")
    hours = pd.read_csv("/app/notebooks/data/hours.csv")


gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
gbr_model = load("../../notebooks/models/gbr_improved_3.joblib")
xtratree_model = load("../../notebooks/models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")

#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]

#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")

#removing the real values
y = model_data.pop('cases_hrs')
X = model_data


# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
    #replacing the actual cpmh values with predicted ones from the model
    df_pred = X
    if(model == 1):
        df_pred['cases_hrs_pred'] = gbr_model.predict(X)
    if(model == 2):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 3):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 4):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    df_pred['cases_hrs_real'] = y
    df_pred['Date'] = dates
    df_pred["GO_LIVE_DATE"] = go_live
    df_pred['Date'] = pd.to_datetime(df_pred['Date'])
    return df_pred

#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
    for d in div_nm:
        if b in d:
            branches[b] = d


#!! dash stuff
layout = html.Div([
        html.H1("Warehouse Analytics"),
            html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using GBR"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_gbr'),
            dcc.Graph(id="pred_real_gbr"),
        ]),
        html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using XtraTrees"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_xtra'),
            dcc.Graph(id="pred_real_xtra"),
        ]),
        html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using Old GBR"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_oldgbr'),
            dcc.Graph(id="pred_real_oldgbr"),
        ])    
    ])

if TESTING:
    app = Dash(__name__)
    app.layout = layout 

"""Graphing real values over predicted values using GBR"""
@app.callback(
    Output("pred_real_gbr", "figure"),
    Input("dropdown-pred-real_gbr", "value"),
)
def display_gbr(value):
    df = model_pred(1)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig



"""Graphing real values over predicted values using XtraTrees"""
@app.callback(
    Output("pred_real_xtra", "figure"),
    Input("dropdown-pred-real_xtra", "value"),
)
def display_xtra(value):
    df = model_pred(2)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig

"""Graphing real values over predicted values using Old GBR"""
@app.callback(
    Output("pred_real_oldgbr", "figure"),
    Input("dropdown-pred-real_oldgbr", "value"),
)
def display_oldgbr(value):
    df = model_pred(4)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig

if TESTING == True:
    app.run_server(debug=True)

Dash App Code

from dash import Dash, dcc, html, Input, Output
import pandas as pd
import plotly.graph_objects as go
from joblib import load
from datetime import datetime

#create a dashboard that shows these graphs by each warehouse

#!! pandas import/data wrangling for presentation stuff
if TESTING:
    gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
    master_data = pd.read_csv("../../notebooks/data/master_dataset.csv")
    hours = pd.read_csv("../../notebooks/data/hours.csv")
else:
    master_data = pd.read_csv("/app/notebooks/data/master_dataset.csv")
    gbr_model = load("/app/notebooks/models/gbr_improved_dock.joblib")
    hours = pd.read_csv("/app/notebooks/data/hours.csv")


gbr_model_old = load("../../notebooks/models/gbr_improved.joblib")
gbr_model = load("../../notebooks/models/gbr_improved_3.joblib")
xtratree_model = load("../../notebooks/models/xtra_tree.joblib")
# randfor_model = load("../../notebooks/models/rfr.joblib")

#!! model stuff
model_data = master_data[master_data["cases_hrs"] <= 300]
model_data = model_data[model_data["Total_Hours"] >= 10]

#dropping all of the redundant columns
to_drop = ["Total_Hours", "Total_Cases", "B_HrsPct", "B_Cases", "Total_Each_Day", "dry_ratio", "clr_ratio", "frz_ratio", "LABEL_TYPE"]
model_data.drop(labels=to_drop, axis=1, inplace=True)
model_data.rename(columns={"BRNCH_CD":"brnch_cd", "A_HrsPct":"a_hrspct", "C_HrsPct":"c_hrspct", "A_Cases":"a_cases", "C_Cases":"c_cases"}, inplace=True)
dates = model_data.pop("Date")
go_live = model_data.pop("GO_LIVE_DATE")

#removing the real values
y = model_data.pop('cases_hrs')
X = model_data


# 1=gbr; 2=xtratree; 3=randfor; 4=old gbr
def model_pred(model):
    #replacing the actual cpmh values with predicted ones from the model
    df_pred = X
    if(model == 1):
        df_pred['cases_hrs_pred'] = gbr_model.predict(X)
    if(model == 2):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 3):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    if(model == 4):
        df_pred['cases_hrs_pred'] = xtratree_model.predict(X)
    df_pred['cases_hrs_real'] = y
    df_pred['Date'] = dates
    df_pred["GO_LIVE_DATE"] = go_live
    df_pred['Date'] = pd.to_datetime(df_pred['Date'])
    return df_pred

#!! creating a dictionary that will allow the branch names to appear in the dropdown menu
div_nm = hours['FULL_MARKET_NAME'].unique()
brnch = model_data['brnch_cd'].unique()
branches = {}
for b in brnch:
    for d in div_nm:
        if b in d:
            branches[b] = d


#!! dash stuff
layout = html.Div([
        html.H1("Warehouse Analytics"),
            html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using GBR"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_gbr'),
            dcc.Graph(id="pred_real_gbr"),
        ]),
        html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using XtraTrees"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_xtra'),
            dcc.Graph(id="pred_real_xtra"),
        ]),
        html.Div([
            html.H3("Predicted Productivity vs Actual Productivity over time per Branch using Old GBR"),
            dcc.Dropdown(branches, "2Z", id='dropdown-pred-real_oldgbr'),
            dcc.Graph(id="pred_real_oldgbr"),
        ])    
    ])

if TESTING:
    app = Dash(__name__)
    app.layout = layout 

"""Graphing real values over predicted values using GBR"""
@app.callback(
    Output("pred_real_gbr", "figure"),
    Input("dropdown-pred-real_gbr", "value"),
)
def display_gbr(value):
    df = model_pred(1)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig



"""Graphing real values over predicted values using XtraTrees"""
@app.callback(
    Output("pred_real_xtra", "figure"),
    Input("dropdown-pred-real_xtra", "value"),
)
def display_xtra(value):
    df = model_pred(2)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig

"""Graphing real values over predicted values using Old GBR"""
@app.callback(
    Output("pred_real_oldgbr", "figure"),
    Input("dropdown-pred-real_oldgbr", "value"),
)
def display_oldgbr(value):
    df = model_pred(4)
    df = df[df['brnch_cd'] == value]
    fig = go.Figure([
        go.Scatter(x=df['Date'], y=df['cases_hrs_pred'], name="Predicted Values"),
        go.Scatter(x=df['Date'], y=df['cases_hrs_real'], name="Real Values")
        ])
    fig.add_vline(x=df.iloc[0][10], line_width=3)
    fig.add_vline(x=datetime(2020, 3, 1), line_width=3, line_color="green")
    return fig

if TESTING == True:
    app.run_server(debug=True)